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Abstract. We study the asymptotic distribution of the displacements 
in hashing with coalesced chains, for both late-insertion and early-inser- 
tion. Asymptotic formulas for means and variances follow. The method 
uses Poissonization and some stochastic calculus. 



1. Introduction 

The standard version of hashing with coalesced chains, due to Williams 
jlUj can be described as follows, where n and to are integers with < n < m. 
(See further Knuth Section 6.4, in particular Algorithm 6.4.C] and the 
monograph by Vitter and Chen j§J.) 

n items placed sequentially into a table with 

to cells 1, . . . , m, using n integers hi € {1, . . . , to}. Each cell 
contains a link field, initially null. Item x^ is inserted into 
cell hi if it is empty; otherwise we follow the links from cell 
hi until we reach the end of the chain (signalled by a null 
link), we add a link to an empty cell (which is chosen as the 
empty cell with largest index) and store the item there. 

For our probabilistic treatment, we assume that each of the m 11 possible 
hash sequences (/ii)™ is equally likely; in other words, the hash addresses hi 
are independent random numbers, uniformly distributed on {1, . . . , to}. 

The displacement di of an item Xi is the number of links we have to 
follow from hi until we find Xi. Large displacements make both insertion 
and searching less efficient, so it is desirable to keep the displacements small. 
(Two different but related quantities are used in other papers to measure 
the efficiency: The number of probes to find the item Xi in the table is di + 1. 
The number of key comparisons to find the item is also di + 1. This should 
be noted when comparing the results below with other papers.) 

The items are thus arranged in linked chains in the hash table. If a new 
item hashes to an empty cell, a new chain with that single item is created. 
If a new item hashes to a cell in an existing chain, then that chain grows by 
addition of a formerly empty cell. It was shown by Chen and Vitter [3| and 
Knott 5 that the average performance could be improved by modifying the 
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algorithm above, inserting the new item at a different place in an existing 
chain. We will therefore study two versions of hashing with coalesced chains: 
L Late-insertion (LISCH). The standard version described above where 

the new item is inserted last in its chain. 
E Early-insertion (EISCH) [3], [3], j^j. If cell hi is occupied, item X{ is 
inserted into an empty cell as above, but this cell is linked into the 
chain immediately after hi. (I.e., if the first free cell is j and the link 
from hi points to k (null or not), then this link is reset to j, and the 
link field in j is set to A;.) This method gives the smallest average 
displacement among all possible insertion schemes 9 Theorem 5.2]. 

Note that the insertion of a sequence of items results in the same set of 
occupied cells in both versions, and that this set is partitioned into chains 
in the same way, but that the order in the chains, and thus the individual 
displacements, may differ. 

Our main result is Theorem 12.11 below (together with its refinement The- 
orem 12. 5[) , which gives the asymptotic distribution of the displacements in 
a random hash table under both insertion methods; we consider also the 
case of unsuccessful searches. As corollaries we easily find earlier known 
asymptotic formulas for the average displacements and for the variances of 
them (some of the latter may be new). These asymptotic distributions are 
studied further in Section|31 some numerical values are given in Tabled The 
proofs are given in Section |1J They are based on Poissonization, regarding 
the items as arriving at random times. 

Remark 1.1. An interesting variation of the algorithm above (HI Exercise 
6.4-43], discussed in detail by Vitter and Chen j^j, is to choose a number 
mi < m and reserve the last m — mi cells as a "cellar" for the undis- 
turbed growth of the chains. We then assume that the hash addresses 
hi G {1, . . . , mi}, and use exactly the same algorithms as above. (These 
versions are called LICH and EICH in 9 .) It is shown in [5] that for given 
n and m, a suitable choice of mi will improve the average performance. 

In this setting, it is also interesting to consider a third version varied- 
insertion (VICH) JET, which behaves like E except that when the chain from 
the hash address contains a cellar cell, the new item is inserted after the last 
cellar cell. It is shown in |3 Chapter 5] that this method gives the minimum 
average among all insertion methods satisfying a weak assumption. 

We have not yet investigated the versions with cellar in detail, but it 
seems that our methods could be used with some additional work to find 
the asymptotic distributions of the displacements in these cases too. (The 
means are given in j3j.) 

Remark 1.2. Corresponding results for hashing with linear probing are 
given by Janson [3] and Viola jSj. Note that, as remarked in 0, the average 
displacement for linear probing tends to infinity if n, m — * oo with n/m — > 1, 
while for the chained hashing studied here, it stays bounded also in the 
extreme case n = m of a full table. 
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2. Notation and results 

By a hash table T we mean not only the final table, but also its con- 
struction history; moreover, we consider the two version above together. 
Formally, a hash table can be regarded as encoded by the numbers m and 
n and the sequence (hi, . . . , h n ) of hash addresses. 

Our prime object of study is the random hash table T m ,n with m cells 
and n items (0 < n < m) and the hash addresses hi, . . . ,h n i.i.d. random 
variables, uniformly distributed on {1, . . . , m}. 

We denote the two insertion policies defined in the introduction by L and 
E, and use E to denote any of these. 

Given a hash table T (with m cells and n items), random or not, and a 
policy E E {L, E}, we let df(T) be the (final) displacement of the i:th item, 
1 < i < n, and 

nf(T) := #{i : df(T) = k}, k = 0,l,..., 
the number of items with displacement k. Note that 

5>f(T)=n. (2.1) 

k 

If n > 0, we let gP(T) denote a randomly chosen displacement in a given 
hash table T using policy S, i.e. the random variable df(T) where I £ 
{1, . . . ,n} is a random index with a uniform distribution. Thus, given T, 
d~(T) has the distribution 

P(d 5 (T) = k | T) = lnf(T). (2.2) 

Similarly, we let d^(T) denote the number of occupied cells encountered in 
an unsuccessful search starting at hash address j, 1 < j < m, and let d^(T) 
denote the number of occupied cells encountered in a random unsuccessful 
search, i.e. d^(T) := d}j(T), where J € {1, . . . , m} is a uniformly distributed 
random index. (Note that in an unsuccessful search starting at j, the number 
of key comparisons equals dj, while the number of probes, d^ say, is d^- if 
d)f > 1 and 1 if dV = 0; i.e., dfj := max(dM, 1)).) We further let 

n^(T):=#{j:d^(T) = k}, k = 0, 1, . . . , 
and note that now, in contrast to (|2.1|) . 

J> fc u (r)=m. (2.3) 

k 



-1 



SVANTE JANSON 



Thus, cf. (j2.2|) . given T, d (T) has the distribution 
P{d [J (T) = k\T)=-n)!(T). 

Our main results is the following theorem, giving the asymptotic distri- 
bution of the displacement of a random item in a random hash table T m ^ n , 
together with the corresponding quantity for an unsuccessful search. 

Theorem 2.1. Suppose that m,n — > oo with < n < m and n/m — > a £ 
[0, 1]. Then, for every k = 0, 1, . . . , 
(i) ' 

P(d U (T m , n ) = k) = -E( ni u (T m , n )) 
(ii) 

P(d L (T m>n ) = fc) = il(4(^, n )) 
(iii) 




fc = 0, 

fc-i 



+ t)(l -e-*) dt, fe > 1 



-a/2, fc = 0, 

I J" (a - t - (a - i) 2 /2) (1 - e-*)*- 1 ctt, > 1; 



(d E (T m>n ) = fc) = -E(n E (T m>n )) 



n 

E m = /!-«/ 2 i k = ^ 

Pa[) ' {U^a-^il-e-f-'dt, k>l. 

(For a = 0, p\j(k) = p$(k) = when k > 1.) For every E E {L, E, U} and 
a G [0, 1], {Pa(k)}kLo is a probability distribution on N. If is a random 
variable with this distribution, i.e. P(-D„ = k) = p%{k\ then these results 
can be written 

d~{T m , n ) -±> Dl. (2.4) 

Moreover, all moments converge in (|2.4|) . i.e., E((i = (7^ Tijn )) r — > E(D^) r /or 
every r > 0. 

It follows immediately that for the number of probes in an unsuccessful 
search, we have 

dVm,n) ^ := max^.l), (2.5) 
again with convergence of all moments. 

As a corollary, we find the asymptotics for the expectations; these have 
earlier been derived, together with exact formulas for Eo! = (7^ Ti n ), by Knuth 
[HI and Vitter and Chen [S] (in equivalent forms for the number of probes 
or key comparisons). 
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Corollary 2.2. Suppose that m, n — > oo with < n < m and n/m — > a G 
[0,1]. T/ien 

E(d u (T m , n ))^E J DU = I( e 2«_i) + | ) 
E(d L (T m , n )) - EI)L = i.(e** - 1) + 2 - 1, 

E(d E (T m>n )) ^E^ = I( e Q -l-a). 

a 

Remark 2.3. Note that the expected number of probes is E(d L (7^ iri )) + 1 
or ~E{d E {T m ^ n )) + 1 for a successful search, and, cf. (|2.5[) . 



E(d°(Tm,n)) = E(d u (T mi „)) + -» E(Z^) = E(jD u } + 1 _ a 



m — n 
m 

for an unsuccessful search. 



Theorem 12 . 1 1 similar lv yields asymptotic formulas for higher moments too; 
in particular we have the following results for the variance. 

Corollary 2.4. Suppose that m, n — > oo with < n < m and n/m — > a £ 
[0,1]. Then 

Var(d u (T min ))^Var J DU 

= _X e 4« , 4 3a _ (1 , 1\ 2a _ 12 , 5_ a _ 

16 c ^ 9 e U"^8/ e 4" ^12" 144' 

,r / rL//-r U ,r r^L l/e 2 °-l\ 2 64e 2a + 37e a + 37 e a - 1 

Var(<f(T m , n )) -Varl£ = W + 



432 a 
K 

a-2/e a -l\2 e a -l 



16 e 16" ^ 24" 36' 



Var( ( i t (T fn .„))^Var J D^ = ^— +2 

2 V a / a 

and, /or i/ie number of probes in an unsuccessful search, 
Var(d°(V n ))-> Vai(5£) 

16 c T 9 C ~ U 8/ e 4" 12"^ 144' 

The asymptotic formula for Var(<f (7^j >n )), together with an exact for- 
mula, is given in Knuth [HJ Answer 6.4-40] and in Vitter and Chen [Sj; the 
corresponding results for Var(d u (7^„ in )) follow easily. (The numerical result 
in [5J Answer 6.4-40] and [S] for the case a = 1, when = D%, should be 
2.65.) The asymptotics of Var(d L (7^ nj „)) are given in j^j. We do not know 
whether the asymptotic of Var(d (7^ n )) have been published earlier. 

Consider a computer program where a large hash table is constructed 
once, and then used many times for finding the items. We assume that each 
item in the table is equally likely to be requested, and that each choice is 
independent of the previous ones. We therefore have two levels of random- 
ness: First we construct a random hash table T with some displacements 
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(dj). Keeping T fixed and choosing a random index I G {l,...,n}, we 
obtain the random displacement d(T) = dj. As the program runs with 
many searches in the hash table, the search times then are (functions of) 
independent observations of this random variable. It is thus interesting to 
study the distribution of this random variable and its properties such as its 
mean and variance. Note that this distribution depends on the hash table 
T, which is itself random; another run of the program yields another T 
and another set of displacements. Hence the distribution of the displace- 
ment d{T) is a random distribution and its mean E(d(T)|T) and variance 
Var(d(T)|T) = E(d(T) 2 \T) - E(d(T)\T) 2 are random variables. In other 
words, we study the conditional distribution of d(T) given T. 

We can refine the results above by conditioning on T mn . The following 
theorem says that we still have the same limits, now with convergence in 
probability. In other words, different realizations of T m ,n have (with large 
probability) almost the same distribution of the displacements, so a typical 
instance of the random hash table T m<n has its displacements distributed as 
the average studied in Theorem 12.11 

Theorem 2.5. Suppose that m,n — > oo with < n < m and n/m — > a G 
[0, 1]. Then, for every k = 0, 1, . . . , with Pa(k) defined in Theorem \2.1l 

(i) P(d u (T m , n ) = k | T m .n) = m-V^T^n) -^pii(fc), 

(ii) F(d L (T m , n ) = k \ T mjV ) = n~ x n\(J m ^ n ) — >Pa(k), 
(hi) F(d E (T m , n ) = k | T min ) = n~ x n\{J m .n) -^p^(k). 

Remark 2.6. A more fancy formulation of Theorem l2.5l is that the distribu- 
tion of d~(T mjn ) converges to p^ in probability, in the space of all probability 
measures on N, equipped with the weak topology (which coincides with the 
I 1 topology on this space); see |2] for definitions. 

Moment convergence holds in Theorem 12 .31 too . i.e. conditioned on T rnn . 

Theorem 2.7. Suppose that m,n — > oo with < n < m and n/m — > a G 
[0, 1]. Then, for every r > and H G {L, E, U}, 

E(d 3 (T mtn y i r m , n ) E(D*y. 

In particular, the conditional mean and variance, given the hash table, con- 
verge in probability to the limits in Corollaries ^^ and \2-4\ 

3. The asymptotic distributions 

We give some further results on the probability distributions p&(k) defined 
in Theorem I2.lt we assume a > 0. We omit the proofs. (Several of the 
results below were obtained with the help of Maple.) 

It follows directly from the definitions in Theorem 12.11 that 

pl(k) = 0{l-e- a ) k = 0{l-e- 1 ) k - 
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hence the probabilities decrease geometrically. More refined asymptotics 
can easily be derived (we omit the details); we have, as k — > oo, for a > 0, 

pV(k) ~ a- 1 e a (e a - l)k~ 2 (l - e~ a f , 

pU(k)~ a - 1 (e a -l)k- 2 (l-e- a ) k . 

In particular, the probability of an extremely large displacement is about e a 
as large for late-insertion as for early- insertion. 

The probability generating functions for D£ follow also easily from the 
formulas in Theorem 12.11 

l-a+t . 

at, 



Ex D « 


oo 

= Y jP u a (k)x k 

k=0 

oo 


= 1 - 


a + x 

Jo 






= 1 - 


a x r 
2 + aJ 




k=0 






oo 




a x r 
2 + aJ 




k=0 


= 1 - 



1 — x + xe t 

* a-t-(a- t) 2 /2 
1 — x + xe~ l 

a-t 

— : dt. 



dt, 



These integrals can be evaluated in terms of the dilog function (and for L also 
polylog), but we do not know any simple form. The generating functions 
are analytic for \x\ < r(o) := (l — e~ a ) , with a singularity at r(a). 

The integrals defining p^(k) are easily evaluated for small k. We find, for 
example, 

p^(l) = a - \a 2 , pJJ(2) = 2e~ a - 2 + 2a - \o? , 

p\ x (X) = \a-\a 2 , p ^(2)=2 ] —^-2 + a-\a 2 , 

E e~ a -l + a E/ „, 1 l-e- a 1 - e~ 2a 



Pa(l) = : , Pa(2) = o — + 



a " v ' 2 a 4a 

No simple pattern is seen, and we leave further investigation to the reader. 

Numerical values for p„(0), . . . ,p^(10), the tail ^he mean IEZ}^ 

and the variance Var are given for a = 0.5 and 1 (half-full and full tables) 
in Table IU 



4. Proofs 

To prove the theorems, we randomize the times the items are inserted 
in the table by Poissonization: We assume that items with hash address 
i arrive according to a Poisson process with intensity 1, the m different 
Poisson processes being independent. We let T(t) denote the hash table at 
time t, when there are Po(t) items with each hash address. For simplicity, 
we write nf(t) := nf(T(t)). 
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k 










p\{k) 







0.5 


0.75 


0.75 


0.0 


0.5 


0.5 


1 


0.375 


0.2083 


0.2130 


0.5 


0.3333 


0.3679 


2 


0.0881 


0.0322 


0.0291 


0.2358 


0.0976 


0.0840 


3 


0.0252 


0.0070 


0.0059 


0.1200 


0.0376 


0.0280 


4 


0.0078 


0.0018 


0.0014 


0.0638 


0.0163 


0.0110 





U.UUzb 


U.UUU4y 


U.UUUoo 


n HQ A o 


U.UU ID 


n nri/i q 
U.UU48 


6 


0.00086 


0.00014 


0.00011 


0.0194 


0.0037 


0.0022 


7 


0.00030 


0.000043 


0.000032 


0.0110 


0.0019 


0.0011 


8 


0.00010 


0.000014 


0.000010 


0.0063 


0.0010 


0.0005 


9 


0.00004 


0.000004 


0.000003 


0.0036 


0.0005 


0.0003 


10 


0.00001 


0.000001 


0.000001 


0.0021 


0.0003 


0.0001 


> 11 


0.000007 


0.0000007 


0.0000005 


0.0031 


0.0003 


0.0002 


E 


0.6796 


0.3046 


0.2974 


2.0973 


0.7986 


0.7183 


Var 


0.7394 


0.3565 


0.3324 


2.6533 


1.2799 


0.9603 




^able 1. Some numerical values 



Combining the m individual Poisson processes, we see that the items 
xi,X2,--- arrive according to a Poisson process with intensity m; we call 
the arrival times Ti,T2,... (we may assume that these are distinct). We 
really have to stop at r m , since the table then is full, but it is convenient to 
think of the hashing as continuing for ever, with the chains growing into a 
virtual, infinitely large attic; no new chains are created after r m . 

The hash addresses of the items x\, X2, ■ ■ ■ are independent and uniformly 
distributed, so except for the random time scale, this is the situation we want 
to study. More precisely, T(r n ) = T m ,n for < n < m. 

Note that T m ~ 1; more precisely, T m /m — ► 1 as m — > oo, as shown in 
Lemma 14.31 below. 

We will consider stochastic processes defined on [0, oo) (although we 
mainly are interested in < t < 1). We say that such a process X(t) 
is increasing if X(s) < X(t) whenever s < t. We let — ^ denote con- 
vergence uniformly on compacts in probability (ucp), i.e. X n — X if 
su Po<i<u \X n (t) ~ X(t)\ -^-> for every u > 0. 

Lemma 4.1. Let, for each n, X n (t), t > 0, be an increasing, stochastic 
process, and let f(t) be a continuous function on [0, oo). If X n (t) — — > f(t) 
for every t > 0, then X n (t) ^+ f(t). 

Proof. Fix u > 0. Let e > 0, and let K be so large that if 5 := u/K, then 
|/(s) — f(t)\ <eifs — t < 5 and < s < t < u. Since each X n is increasing, 
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the limit f(t) is too. Hence, if {k — 1)5 < t < k5, 

X n ((k - 1)5) - f((k - 1)5) - e < X n ((k - 1)5) - f(k5) 

< X n (t) - f(t) < X n (k5) - f((k - 1)5) < X n (k5) - f(k5) + e, 
and, consequently, 

sup \X n (t) - f(t)\ < sup \X n {k5) - f(k5)\ + e. 

0<t<u 0<k<K 

We know that X n (t) — f{t) — for every t > 0. We apply this for t = k5, 
k = 0, . . . , K, and find that whp (i.e., with probability — ► 1 as n — > oo) 
\X n (k5)-f(k5)\ < e for k = 0, . . . ,K , and thus sup 0<t<u \X n (t)—f(t)\ < 2e. 

Since e > is arbitrary, sup 0<t<M \X n (t)-f(t)\^0. □ 

Let N(t) be the number of items that have arrived at time t. Since each 
item is put into some empty cell, the number of empty cells at time t is 
(m — N(t)) + , i.e. m — N(t) for t < r m and then 0. 

Lemma 4.2. As m — > oo, N(t)/m —^t fort>0. 

Proof. We have N(t) ~ Po(mt), and thus N(t)/m — — > t as m — > oo for every 
t > 0. The convergence ucp follows from Lemma 14. II □ 

Lemma 4.3. Suppose that m — > oo and n/m — > a, witt < n < m. T/ien 
r n — a. Consequently, if X m X /or some stochastic processes X m and 
X, where X(t) is continuous, then X m (T n ) — ^ X(a). 

Proof. r n is the sum of n i.i.d. waiting times, each Exp(l/m), so Er n = 
n/m — > a and Varr n = n/m 2 — > 0, whence r n — — > a by Chebyshev's 
inequality. (Alternatively, this is a standard consequence of Lemma 14.21 
For e > 0, P(r n > a + e) < P(N(a + e)/m < n/m) -> 0. Similarly, 
P(r n < a - e) ^ 0.) 

The final assertion follows because whp r n < a + 1, and then \X m (T n ) — 
X(a)\ < sup s < a+1 \X m (s) - X(s) | + \X(r n ) -X(a)\ □ 

Chains. When an item arrives to an empty cell, a new chain of length 1 
is created. The chain then grows one unit each time it is hit. Hence each 
chain, once created, grows according to a birth process where the transition 
£ —* i + 1 has intensity £, and different chains grow independently. (In order 
for this to hold for t > r m too, we may pretend that new items arrive also in 
the attic, but are ignored unless they hit an existing chain. Similar ad hoc 
modifications have to be made after r m for other quantities too, in order for 
the arguments below to be valid; a simple possibility is to redefine ni(t) for 
t > r m so that (|4.8|) holds and redefine the processes Z and W in (|4.10|) and 
(|4.12|) to be constant for t > r m . The details do not matter, since we will 
later consider only t = r n < r m , so we will ignore them.) 

The growth process of each chain thus is the same as the Yule process, or 
binary splitting, a branching process where each individual after a lifetime 
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distributed as Exp(l) splits into two. It is well-known, see e.g. Section 
III. 5], that if we start a Yule process with a single particle at time 0, the 
number of particles at time t has the geometric distribution Ge(e _i ) with 
mean e and 

P(fc particles) = e _t (l - e"^" 1 , k > 1. (4.1) 

Let C(T) be the number of chains in the hash table T, and denote their 
lengths by L\{T), . . . , L C (q-^{T). For T(i) we write C(i) and Lj(t). 

Lemma 4.4. Lei C^(t) 6e £/ie number of chains of length I in T(t). Then, 
for each k and t > 0, as m — > oo, 

iVQ(f)^ r Al (l- 8 )(l-e-(*-)) fc - 1 d S . (4.2) 

7o ' ' 

Proof. Informally, we observe that in a tiny time interval [s,s + ds], about 
mds items arrive, and (m — N(s))ds ~ m(l — s)c?s of them create new 
chains, for s < 1. Of these chains, by Q4.1|) . a proportion (1 — e-C*-*))^- 1 
have grown to length at least k at time t, and the result follows by integration 
over s. 

To be more formal, let, for k > 1, u > and j > 0, fu(j) be the 
probability that a Yule process that starts with jf particles at time has 
reached at least k particles at time u; this is thus equal to the probability 
that a chain of length j at some instance s grows to length at least k at time 
s + u. We have fu(j) = 1 for j > k and all u, /q (i) = \j > k}, and, by 

63D, 

/W(l) = (l-e-«) fc - 1 . (4.3) 
For fixed & and t, and < s < t, let 

x( S ) :=E^( s )/Sw = E/S(^(«))- 

£>i i=i 

X(s) is thus the expected number (given T(s)) of the chains present at s 
that have grown to length at least k at t. In particular, X(t) = Yli>k Ce(t). 

Since new chains, all of length 1, are created with the rate (m — iV(s))+, 
it follows that the process 

Y(s):=X(s)- f S fiH(l){m-N(u)) du, 0<s<t, (4.4) 
Jo 

is a martingale. Moreover, Y(s) has a jump AY(s) = /^ s (l) of size 
|AY(s)| < 1 each time a new chain is created, and Y(s) is smooth with 
a bounded derivate between the jumps, so s i— * Y(s) is of finite varia- 
tion, Hence, see e.g. Protter [3 II. 6], the quadratic variation [Y, Y] t = 
£ s < t AY(s) 2 < N(t), and, observing Y(0) = X(0) =0, 

E7(i) 2 = E[Y, Y] t < E 2V(t) = mt. 
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In particular, as m — > oo, Y(t)/m — ^ by Chebyshev's inequality, i.e. 



Combined with Lemma 14.21 this shows 

Ee> k Ct(t) x(t) P f* 



m m 



Jo 



which by ()4.3|> proves (|4.2[) for fixed t > 0. Convergence ucp follows by 
Lemma 14. II □ 

Lemma 4.5. (i) For every t > and r > 0, i/iere exists a constant K(t,r), 
not depending on m, such that 

oo C(t) 

E^fCf(t) =E^L j (t) r < K{t,r)m. 
1=1 j=l 

(ii) For every r > 0, i/iere exists a constant K(r), not depending on m 
or n, such that 

E £ Lj{T m>n y < K(r)n. 

3=1 

Proof, (i): Since Y(s) in (|4.4j) is a martingale with Y"(0) = 0, we have 
EY(t) = and 

EC k (t) <EX(t) = E / / t (fc) Jl)(m-iV(n)) du 

JO 

<t/f } (l)m = t(l-e-*) fc_1 "i- 
Hence, if a < (1 — e~*) _1 , 

oo oo 

E^VQ(i) < amt^(a(l - e^f' 1 = — ° _ _ m. (4.5) 
*=1 «=l ~ a l ~ e J 

Taking e.g. a = 1 + e - ', the result follows, since sup e £ r /a £ < oo. 

(ii): Since ^ ■ Lj(t) r is increasing and T(r n ) = T m>n is independent of r„, 
we have for every i > and a > 1 

E^V^ > e(VV^ t ")[t„ < i]) = E(j^a L ^ Tm ' n) ) P(r n < t). 

Choose t := 2n/m < 2 and a := 1 + e" 2 . Then, by (|4~5|) . 

E^a^W < e 4 atm = 2e 2 (e 2 + l)n. 

3 

Moreover, N(t) ~ Po(mt) = Po(2n), and thus 

P(r n <t)= F(N(t) >n)= P(Po(2n) > n) -» 1, as n — > oo; 
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hence, for some constant c > and all n > 1, P(r n < t) > c. Consequently, 

E (^ aLj(rm ' n) ) - 2e2 ^ 2 + 
i 

and the result follows. □ 

Remark 4.6. The collection of chain lengths evolves as a generalized Polya 
urn with balls of infinitely many types 0,1,. . . ; we regard each empty cell as 
a ball of type and each cell in a chain of length I as a ball of type £. The 
dynamics of the urn thus is that if a ball of type is drawn, it is removed 
and replaced by a ball of type 1; if a ball of type i > 1 is drawn, I balls of 
type I are removed together with one ball of type 0, and £ + 1 balls of type 
£ + 1 are added. We start with n balls of type 0. We will, however, not use 
this urn representation. 

U. In an unsuccessful search starting at address j in a hash table T, the 
number dj(T) of searched occupied cells is if the cell j is empty; otherwise 
the cell belongs to a chain, and d^ equals 1 + the number of cells in the 
chain after j. 

Hence, ng (T) is the number of empty cells in T, and for T(t), 

n\j(t)=(m-N(t)) + . (4.6) 

By Lemma 14.21 thus 

m-V>(*)^Pt(0):=(l-t)+. (4-7) 

(We see also that tiq (T m ,n) = m — n, directly proving the case k = in 
Theorems BTTTi) andEBft).) 

For k > 1, there is exactly one cell with d^ in each chain of length £ > k, 
and thus, for T(t), 

n fe u (*) = ^Q(i). (4.8) 

e>k 

Consequently, Lemma 14 . 41 yields . for k > 1, using u = t — s, 

!n fc u (t)^p^):= / (l-^l-e-C*-))*" 1 * 
m Jo 

= / (l-t + n)(l-e- n ) fc ~ 1 cin. (4.9) 
J(t-i)+ 

Theorem 12 .5f i) follows by Lemma 14.31 

L. For the standard (late-insertion) method L, when a new item arrives with 
a hash address j, the insertion algorithm begins with an unsuccessful search 
for the item (followed by finding an empty cell) . The displacement of the new 
item is thus the same as the number d^ for an unsuccessful search starting at 
j; note that for L, the displacement never changes after the item is inserted. 
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Consequently, for T(t), new items with displacement k are created at the 
rate ri£(t), and 

Zit) := n L k (t) - [ t n [J k (s)ds (4.10) 



is a martingale. The jumps are all 1, and we have, as for Y above, [Z, Z\ < 
N(t), and m~ 1 Z(t) -—>■ 0. Moreover, by Doob's inequality, see e.g. [Tj p. 
11], 

E(supZ(s) 2 ) < AEZ(t) 2 = AE[Z,Z] t < 4EiV(t) = Amt, 



ucp 



and hence m l Z(t) — ► 0. Consequently, by (|4.1()|) and (j4.9j) . 

m -l„L(t)J^ fp U s {k)ds. 
J 

For a > we multiply by m/n — > a -1 and find by Lemma 14.31 

n- l n l k (r n ) ^p L a (k) ■= a- 1 f p)j(k) ds (4.11) 

J o 

as asserted in Theorem 12 . 5f ii) . Explicitly we have, for < a < 1, 

pa 

pL (o) = a ~ 1 (l-s)ds = l- a/2 



and, for fc > 1, 

/Of /"S 
/ (l-s + i)(l-e~*) dtds 
_=o Jt=o 



a 



1 r r (i-s+t)(i- e -*) fc i dsdt 

Jt=0 Js=t 



= a~ 1 (a-t-(a- tf /2) (l - e"*)^ 1 dt. 
Jt=o 

For a = 0, we observe that P(d^(7^ iTl ) 7^ 0) < (i — l)/m, and thus 

n 

E\n- no(^m,n)| < < n 2 /m; 

i=l 

hence E |1 — no(-^i,n)/ n l — n / m — > a = 0. This yields Theorem 12. f>f ii) for 
a = with pfc(0) = 1 and p^{k) = 0, fc > 1. 

E. For the early-insertion method E, a new item that hashes to an empty 
cell gets displacement 0, which remains unchanged for ever. Hence 71q(T) = 
n\j(T), and 

n-^T) = n-^T) -*+ p L a (0) = 1 - a/2 

by the preceding case. 

An item x that hashes to an occupied cell gets an initial displacement 
1, and this displacement increases each time a new item hashes to one of 
the cells in the subchain beginning with the hash address of x and ending 
just before x; the number of such cells is the displacement, and thus the 
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displacement grows according to the same Yule process as the chains. Fix t 
and k, let fu(j) be as in the proof of Lemma 14.41 and let, cf. (|4.4|) . 

W(s) := 5>f - f fl; k) u (l)N(u)du. 0<s<t, (4.12) 

i>i h 

Again, this is a martingale. This time, however, the jumps may be larger 
than 1, since more than one item can get its displacement increased when a 
new item is inserted. Clearly, the jump AW when a new item is inserted is 
at most the length of the chain where the new item was inserted, since only 
the items in this chain can have their displacements changed. Since \W, W]t 
equals the sum of the squares of all jumps up to t, it is at most the sum of 
the squares of the lengths of all chains that have existed during the process. 
A chain in T(t) of length i is formed by t insertions, and their contribution 
to the latter sum is Yli k 2 < £ 3 . Hence, 

C(t) oo 

[w,w] t = £ \aw(s)\ 2 < X>;(*) 3 = E^)' 

s<t j=l l=\ 

and Lemma 14.51 shows that 

E Wit) 2 = E[W, W] t < Kit, 3)m. 

Hence, as above, m~ 1 W(t) — > 0, which together with Lemma l4~2*l and (|4.3|) 
yields 

m- 1 VV E (t) - f\l - e-^f^udu 0. 
By Lemma 14. 11 thus, with s = t — u, 




Replacing k by k + 1 and subtracting, we find 

m- l nl{t) ^ [\l-e~ s ) k ~ 1 e- s (t-s)ds, 
Jo 

and Theorem I2.5f iii) follows by Lemma 14.31 when a > 0. 

If a = we have n^(T m , n )/n = nQ-(T m , n )/n — > 1 by the case L, and thus 
also n^(T m)n ) — — > for k > 1. 

This completes the proof of Theorem 12.51 

Proof of Theorem \2.1\ Theorem 12. lf i)-(iii) follow by taking expectations in 
Theorem 12.51 using dominated convergence. 

We verify directly that {Pq(^)}£Lo is a probability distribution by sum- 
ming. For U we have, by (|4.7j) and 1)4. 9|) and summing the geometrical series 
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^(l-e-sf' 1 = e s 



00 /-a 

Vp^) = l-a+ (1 - a + s)e s ds = 1 - a + [(s - a)e s ]% = 1. 
For L and E, the case a = is trivial. If < a < 1 we have by Q4.11JI 
J2p L a (k) = a- 1 Y.P^ k ) dt = a 'i dt = l 



k=0 



and, by the definition in Theorem 12.11 

°° ret 

V^(fc) = 1 - a + a^ 1 / (a-t)dt = l. 



fc=0 



Finally, note that each chain of length I contributes (for L, E and U) 
displacements which are all at most £. Hence, for r > 0, 



Eld 



^(%n,nY | %n,n) — / , — -Va — — N 

' n J n ' J 



(also for U), and thus, using Lemma 14.51 

E(d s (T m , n ) r ) Kn^R^L? 1 <K(r + l). 

j 

Replacing r by r + 1 we see that the family d~{T m ^ n ) r is uniformly integrable, 
and thus C31 implies E(d s (T m ^)) r -» E(Z^) r . ' □ 

Proof of Corollary \2. 6 A It only remains to compute the expectation of D^. 

OO „q, OO 

ED^ = J2kPa(k)= / (l-a + tjj^l-e-*)* -1 * 
fc=l ^° fc=i 

= / (l-a + t)(l-(l-e-*))" 2 (it = / (l-a + t)< 



e zt dt 



t a 1 
LV2-2 + 4 1 ' 



2/ 



a = I e 2- + °L _ I 

o 4 2 4' 



and, similarly, (for a > 0) 



ED 



1 

a A) 

1 

a 



a — t 



(a - i) 2 



a — i) 2 a — t 1 



+ 



+ ;; e 



'2t 



o 8a 



4 4 8a' 



ED^ = - / (a-t)e'dt = a- 1 [(a-t + l)e*]" = a" 1 (e a - 1 - a) . 
a JO 



□ 
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Proof of Corollary \2.J\ Similar to Corollary 12.21 we omit the details. (We 
used Maple to perform the integrations.) For the final part, note that 

Var(5U) = E((Z£) 2 ) + 1 - a - (E(D%) + 1 - a) 2 
= Var(D^)-2(l-a)E(D^)+a-a 2 . 

Proof of Theorem \2. 7| An immediate consequence of Theorems 12.11 and 12.51 
and the following general probabilistic lemma (with X = d"(T mtn ) r , Y = 

r m>n , z = (D*) r ). □ 

Lemma 4.7. LetX n , Y n and Z be random variables (withX n andY n defined 
on the same probability space, and where Y n may take values in any measure 
space), such that X n > and Z > 0, and, for every real x, as n — ► oo, 

¥{X n < x | Y n ) P(Z < x). (4.13) 

Suppose further EX n ^EZ. Then E(X n \Y n ) -^EZ. 

Proof. Note first that, for every real x, by (|4.13j) and dominated convergence, 

F(X„ <x)= E(F(X n < x | Y n )) -» P(Z < x), 

and thus X Z. 

For any fixed K > we thus have X n A -ftT — ► Z A K and thus, by 
dominated convergence again, E(X n A K) — » E(Z A if). Hence also 

E(X n - ET)+ = E(X n — X n A K) = E(X n ) - E(X n A K) 

— > E(Z) — E(Z A K) = E(Z — K) + . (4.14) 

Moreover, 

\E(X n A K Y n ) - E(Z A K \ Y n )\ 

rK r-K 

I F(X n >x\Y n )dx- I P(Z > x) dx 
Jo Jo 

< |P(X n > x I Y n )-F(Z >x)\dx 
Jo 

and thus, by ()4.13|) and, yet again, dominated convergence (twice), 
E | E(X n A K | Y n ) — E(Z A K \ Y n )\ 

< / E|P(X n > x\Y n )- P(Z > x)\ dx -> 0. (4.15) 

JO 

By the triangle inequality and X = X A K + (X — K) + , 

\E(X n | Y n ) - E{Z)\ < E((X n - K)+ | Y n ) 

+ | E{X n A K | Y n ) - E(Z A K) \ + E(Z - K)+. 
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Taking expectations, we see by (|4.15j) and (j4.14|) that 

limsupE|E(X n | Y n ) -E(Z)| < limsupE(X n - K)+ + E(Z - K) + 

= 2E(Z-K) + . 

Since K is arbitrary, we see by letting K — > oo that the left hand side is 0, 
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